refactor(ir): remove the decimal precision promotion logic #8195

chelsea-lin · 2024-02-01T23:35:20Z

Description of changes

This change is removing the decimal precision promotion logic from the unbounded ibis expr. The logic was introduced by #4330 so the decimal precision can matches the DB2. However, this rule is not generic for all. For example, BQ only support precision=38.

Issues closed

Resolves bug(bigquery): Numeric precision increase 1 unexpectedly after every sub/add operations #8189

ibis/expr/rules.py

cpcloud · 2024-02-02T10:31:58Z

@webmiche Thoughts on dropping the type evolution here? It seems to be causing an issue for BigQuery which has much simpler promotion rules than DB2.

kszucs · 2024-02-02T12:22:04Z

@chelsea-lin I rebased your branch and retargeted the PR against main.

webmiche · 2024-02-02T14:55:37Z

@webmiche Thoughts on dropping the type evolution here? It seems to be causing an issue for BigQuery which has much simpler promotion rules than DB2.

For context, I have not been working a lot with databases and ibis lately given that my project finished.

I think this touches the exact core discussion we had back in the initial PR: type inference rules are backend-specific whereas ibis tries to build data structures that are as generic as possible. And I guess what we have here is exactly that, a backend that does not follow the same type inference rules.

I think there is a range of possibilities of what could be done for this case. From the top of my head:

inferring type rules according to the backend
define ibis specific type rules and add clue code for backends
make decimals "opaque" to some extent, i.e. make them not have a precision and scale at all (at least in some cases where this info is not directly encoded by the frontend)

I personally advise against inferring type inference rules from backend choice as this somewhat defeats the purpose of ibis in the first place. Also, the ibis specific type rules sound like just another specification on top of all the others, so that's no real solution.

I think the third choice is probably most reasonable. Honestly, I don't know exactly how the ibis data structures work currently, so I can't tell the ramifications of such a change. Maybe deleting the type inference/binop promotion-functionality already achieves this to some extent. Maybe, this change is only the first towards handling this in a principled way.

So, all in all, I am fine with dropping it. If I was still working on the project I implemented the type inference for originally, I would probably add some clue code to handle this in between the ibis data structures and my data structures. For this, I would definitely appreciate it if ibis communicates to me which decimals are defined by the frontend and which it is not sure about. The "opaque" decimal solution would achieve this, AFAIU.

Maybe @ingomueller-net has more thoughts on this.

cpcloud · 2024-02-02T17:31:02Z

@webmiche Thanks for the response! I was thinking something similar re opaque decimals. We took this approach a long time ago with strings, and it has not proven to be an issue while also allowing us to mostly ignore the various CHAR(N)/VARCHAR(N) string types.

I think we'd have to implement this and try it out on a few backends to see what its blast radius would be. I suspect there might be some issues at the Ibis <-> pyarrow boundary.

chelsea-lin · 2024-02-02T17:48:29Z

@chelsea-lin I rebased your branch and retargeted the PR against main.

@kszucs I also need the commit in the-epic-split branch. What's the process to do that?

kszucs · 2024-02-05T14:47:05Z

We keep the-spic-split rebased on top of the main branch, so this change will reach both branches.

cpcloud · 2024-02-05T21:29:34Z

Given this isn't breaking anything I'm inclined to merge it. @kszucs Thoughts?

kszucs

LGTM, thanks @chelsea-lin!

kszucs · 2024-02-05T23:02:45Z

Given this isn't breaking anything I'm inclined to merge it. @kszucs Thoughts?

Agree.

ingomueller-net · 2024-02-06T13:32:02Z

Thanks, everybody, for the work on Ibis and this issue!

What I have asked myself in the past due to this and similar issues is: what is the degree of portability that Ibis aims for? Since the precision promotion logic is different in some backends, Ibis programs may result in different results when run on different backends. Is that expected? Is there a place where these differences are tracked? When do we consider a backend as "correct" if we allow these kind of differences?

cpcloud · 2024-02-06T16:08:49Z

@ingomueller-net Good questions!

what is the degree of portability that Ibis aims for?

I don't have a general and precise answer for this.

Here are some aspects of Ibis's output that aren't portable across backends and likely never will be:

Aggregations involving floating point arithmetic (any parallelization has the potential to defeat any expectation of exact equality)
Decimal precision and scale is not uniformly supported across backends. For example, postgres supports arbitrary precision decimals, while BigQuery supports up to 32 bytes with BIGDECIMAL.

I'm sure there are others.

Is that expected?

Depends. Generally speaking, no, but there's functionality that is out of control that cannot be replicated across backends.

Is there a place where these differences are tracked?

At least I can answer this one definitively: no there is not :)

Perhaps a docs page with specific details here is in order!

When do we consider a backend as "correct" if we allow these kind of differences?

It would be difficult to do without a definition of correctness. I'd like to see if there's a way we can capture the backend differences first, and then see whether such a definition comes out of that.

ingomueller-net · 2024-02-06T16:35:47Z

Thanks for the prompt answer, @cpcloud!

Aggregations involving floating point arithmetic (any parallelization has the potential to defeat any expectation of exact equality)

That's a good one! And somewhat tricky: one could argue that any possible order is "correct" by accepting any possible ordering (like the SQL standard does, I believe).

[ Off topic: There are actually quite a number of sources of non-determinism. I just now remembered that, in some previous life, I wrote a paper about them as well as a potential solution, which, however, I think no real system ever implemented ;) ]

I'm sure there are others.
[...]

Is there a place where these differences are tracked?

At least I can answer this one definitively: no there is not :)

Perhaps a docs page with specific details here is in order!

When do we consider a backend as "correct" if we allow these kind of differences?

It would be difficult to do without a definition of correctness. I'd like to see if there's a way we can capture the backend differences first, and then see whether such a definition comes out of that.

+1

ingomueller-net · 2024-02-07T16:58:57Z

To add to the issue of aggregations, note that the result of any non-associative function depends on the input order, including first, last, array_agg (not sure if it exists), sum on floats, but also sum on integer in most cases (since it may depend on the order whether or not an overflow occurs).

And then the result of most window functions obviously depends on the order.

Again, all of these are non-deterministic in many SQL systems, so you don't get a guarantee to obtain the same result from run to run against the same system instance there either. I, thus, think it'd be perfectly valid to declare all possible orderings as "correct." However, it has to be clear what exactly the semantics of each operation are, possibly referring to non-determinism for some operations.

ingomueller-net · 2024-02-07T17:00:55Z

Oh, and another point that came up when I asked the same question on substrait: The semantics of ORDER BY aren't the same in every system, for example, for NaNs which could come before or after numbers.

kszucs reviewed Feb 2, 2024

View reviewed changes

ibis/expr/rules.py Show resolved Hide resolved

kszucs force-pushed the the-epic-split branch 2 times, most recently from 432d151 to e4df99b Compare February 2, 2024 11:42

kszucs force-pushed the chelsealin_numeric branch from 90aad0e to e7d3900 Compare February 2, 2024 12:21

kszucs changed the base branch from the-epic-split to main February 2, 2024 12:21

chelsea-lin force-pushed the chelsealin_numeric branch from e7d3900 to 838b242 Compare February 2, 2024 17:27

chelsea-lin changed the base branch from main to the-epic-split February 2, 2024 17:29

chelsea-lin force-pushed the chelsealin_numeric branch from 838b242 to 8cccf5b Compare February 2, 2024 17:45

chelsea-lin changed the base branch from the-epic-split to main February 2, 2024 17:46

chelsea-lin force-pushed the chelsealin_numeric branch from 436ef1b to a64622e Compare February 5, 2024 17:50

chelsea-lin changed the title ~~fix: remove the decimal precision promotion logic from ibis.expr~~ fix: remove the decimal precision promotion logic Feb 5, 2024

chelsea-lin force-pushed the chelsealin_numeric branch 2 times, most recently from cec4807 to d20f5de Compare February 5, 2024 20:11

kszucs force-pushed the chelsealin_numeric branch 2 times, most recently from 794184d to 5ee75c8 Compare February 5, 2024 23:00

refactor(ir): remove the decimal precision promotion logic

c46d5b6

kszucs force-pushed the chelsealin_numeric branch from 5ee75c8 to c46d5b6 Compare February 5, 2024 23:01

kszucs approved these changes Feb 5, 2024

View reviewed changes

kszucs changed the title ~~fix: remove the decimal precision promotion logic~~ refactor(ir): remove the decimal precision promotion logic Feb 5, 2024

kszucs enabled auto-merge (rebase) February 5, 2024 23:03

kszucs merged commit 0db3ec7 into ibis-project:main Feb 5, 2024
82 checks passed

ingomueller-net mentioned this pull request Feb 6, 2024

What is the level of targeted portability of Substrait? substrait-io/substrait#596

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(ir): remove the decimal precision promotion logic #8195

refactor(ir): remove the decimal precision promotion logic #8195

chelsea-lin commented Feb 1, 2024

cpcloud commented Feb 2, 2024

kszucs commented Feb 2, 2024 •

edited

Loading

webmiche commented Feb 2, 2024

cpcloud commented Feb 2, 2024

chelsea-lin commented Feb 2, 2024 •

edited

Loading

kszucs commented Feb 5, 2024

cpcloud commented Feb 5, 2024 •

edited

Loading

kszucs left a comment •

edited

Loading

kszucs commented Feb 5, 2024

ingomueller-net commented Feb 6, 2024

cpcloud commented Feb 6, 2024

ingomueller-net commented Feb 6, 2024

ingomueller-net commented Feb 7, 2024

ingomueller-net commented Feb 7, 2024 •

edited

Loading

refactor(ir): remove the decimal precision promotion logic #8195

refactor(ir): remove the decimal precision promotion logic #8195

Conversation

chelsea-lin commented Feb 1, 2024

Description of changes

Issues closed

cpcloud commented Feb 2, 2024

kszucs commented Feb 2, 2024 • edited Loading

webmiche commented Feb 2, 2024

cpcloud commented Feb 2, 2024

chelsea-lin commented Feb 2, 2024 • edited Loading

kszucs commented Feb 5, 2024

cpcloud commented Feb 5, 2024 • edited Loading

kszucs left a comment • edited Loading

Choose a reason for hiding this comment

kszucs commented Feb 5, 2024

ingomueller-net commented Feb 6, 2024

cpcloud commented Feb 6, 2024

ingomueller-net commented Feb 6, 2024

ingomueller-net commented Feb 7, 2024

ingomueller-net commented Feb 7, 2024 • edited Loading

kszucs commented Feb 2, 2024 •

edited

Loading

chelsea-lin commented Feb 2, 2024 •

edited

Loading

cpcloud commented Feb 5, 2024 •

edited

Loading

kszucs left a comment •

edited

Loading

ingomueller-net commented Feb 7, 2024 •

edited

Loading